The imputeORS package provides to tools to impute missing values in the Occupational Requirements Survey (ORS) administered by the U.S. Bureau of Labor Statistics (BLS). It uses an iterative, machine learning based approach. The package contains functions to retrieve, process, and analyze the data, and produces useful visualizations of both the iterative process and the final imputed results. In this vignette, we go through the full analysis stream for a subset of the data.
Occupational Requirements Survey. The Bureau of Labor Statistics (BLS) collects numerous data to capture important information regarding the workforce in the United States. One of the tools they use to do this is the Occupational Requirements Survey (ORS), which is a comprehensive survey that gathers information regarding the various demands and conditions (requirements) of different occupations within the U.S. economy. These requirements fall into one of four categories: physical demands; environmental conditions; education, training, and experience; and cognitive and mental requirements. The initial data that were analyzed were publicly available datasets known as “Wave 1” and “Wave 2.” The survey is accomplished in repeated waves across several years; the data analyzed from the public set constituted Wave 1, year 2 and Wave 2, year 1. In this vignette, we focus exclusively on a subset of the Wave 1 data.
Structure of ORS Data. The data used in this vignette are publicly available from the BLS ORS, and are included within the imputeORS package. Importantly, this survey captures the distribution for the minimum job requirements of various occupations, rather than the actual qualifications of individuals working in these occupations (i.e., the survey does not capture information of under- and/or over-qualified employees). Additionally, this dataset covers only civilian occupations.
Numerous types of data are captured in the ORS so we focus on the information we identified as relevant to our goal of imputing missing estimates for population distributions of occupational requirements. For any given observation, the relevant fields utilized from the ORS were as follows:
Character fields:
occupation_text: Descriptive text of occupationdata_element_text: Descriptive text of job requirement/demanddata_type_text: Descriptive text of levels associated with any given requirement/demandestimate_type_text: Describes type of measurement associated with observation (either “Estimate” or “Standard Error”)estimate_type: Further describes type of measurement associated with observation (our focus was on observations described as ``Percent”)unit_of_measure: Describes units of measurement associated with observation (our focus was on observations described as “Percentage”)Numeric fields:
upper_soc_code: Classifies occupation using the Standard Occupational Classification (SOC) systemadditive_group: Exhibits one-to-one correspondence with data_element_text (i.e. requirement). Importantly, estimates of observations belonging to a given occupation_text within a specific additive_group should sum to 100%; we will refer to these distinct combinations of occupation and requirement as occupational groups moving forwardvalue: Measurement estimate (our focus was on percentage values describing mean distributions of requirements)The preprocessing stage involves a number of steps, detailed below.
Raw Data and Transform. First, we obtain the raw data from the ORS website, and transform it into the format used by the package. We see a sample of this transformed data, below.
#> upper_soc_code occupation_text
#> 751 11101100 Chief Executives
#> 3000 11303100 Financial Managers
#> 10001 13115100 Training and Development Specialists
#> 50000 31909100 Dental Assistants
#> 65001 39909900 Personal Care and Service Workers, All Other
#> 70000 43301100 Bill and Account Collectors
#> data_element_text data_type_text estimate_type_text
#> 751 Post-employment training YES Estimate
#> 3000 Lifting/carrying Occasionally NEGLIGIBLE Standard error
#> 10001 Pre-employment training: Certification NO Estimate
#> 50000 Post-employment training YES Standard error
#> 65001 Sitting 90th Percentile Estimate
#> 70000 Near visual acuity YES Standard error
#> estimate_type unit_of_measure reference_group additive_group value
#> 751 Percent Percentage 14 14 49.7
#> 3000 Mode Percentage 25 NA 9.5
#> 10001 Percent Percentage 89 NA 29.7
#> 50000 Percent Percentage 14 14 1.6
#> 65001 Percentile Percent of day 78 NA 70.0
#> 70000 Percent Percentage 52 52 -6.0
Synthetic Additive Groups and Filtering. Next, we manually create a handful of additive groups using the available data, a step that was added on the advice of SSA analysts. These groups are:
The numbers assigned to these synthetic additive groups are derived from the reference_group field associated with the relevant observations. After the creation of these groups, we trim the data to only include observations associated with a valid additive group. This is all accomplished using the function syntheticAddGroups(), below.
x <- syntheticAddGroups(ors.from.csv)
Missing Observations and Corrections. Once the data is trimmed, we then go about the task of “completing” the data. One of the gaps in the data is the fact that some observations within an occupational group are unlisted (i.e., not only is the estimate missing, but the observation itself is not in the original data). To fix this issue, we generate all possible levels within a given occupational group, and then merge these observations with those available in the original dataset to produce a “complete” dataset. This is accomplished using the fillMissingObservations() and otherMissingObservations() functions. Note, because these additional observations do not have estimates associated with them, they become part of the imputation problem moving forward. Some additional corrections are also applied here (see function dataCorrections() for more details).
x.complete <- fillMissingObservations(x)
x.complete <- otherMissingObservations(x.complete)
x.complete <- dataCorrections(x.complete)
Separate Means and Errors. In this step we separate the complete data into two data frames: one exclusively containing mean estimate observations, and another exclusively containing error observations. These are merged in a later step such that rather than having separate observations for the estimate and error for a given measure, they are assigned to the same observation (wide data rather than long data).
x.complete.means <- getMeanEstimates(x.complete)
x.complete.errors <- getErrors(x.complete)
Direct Calculation. Some of the observations with missing estimates (including some of those generated by data completion) can be filled in based on the available information. For example, some requirements are binary in nature, e.g. the requirement Prior work experience has only two levels: YES, and NO. Given one estimate, the other can be calculated. Similarly, for requirements where all but one estimate was known, the final one can easily be determined. Each occupational group is assessed to determine whether it is such an “N-1 group,” and those that are have their missing estimate calculated and added to the data.
ors.data <- fillNminusOneGroups(x.complete.means)
Assign Errors, Bounds, Predictors, and Weights. In this step, we format the data for modeling. We first assign the errors to their relevant observations, and add bounds to each estimate based on the errors (where available). For observations missing an estimate/error, the bounds were calculated based on the remaining available percentage within the occupational group. We also add predictor columns, derived from the requirement categories, requirements, and requirement levels associated with each observation. This mapping is stored in a variable called predictors.data, a sample of which is shown below. Finally, we add default modeling weights to each observation, based on the presence/absence of an estimate.
#> data_element_text data_type_text additive_group
#> 11 Pre-employment training YES 12
#> 12 Prior work experience NO 13
#> 13 Prior work experience YES 13
#> 14 On the Job Training NO 14
#> 15 On the Job Training YES 14
#> 16 Sitting vs. standing/walking at will NO 15
#> 17 Sitting vs. standing/walking at will YES 15
#> 18 Reaching overhead NOT PRESENT 16
#> 19 Reaching overhead OCCASIONALLY 16
#> 20 Reaching overhead SELDOM 16
#> Requirement Frequency Intensity
#> 11 Pre-employment training 100 100
#> 12 Prior work experience 0 100
#> 13 Prior work experience 100 100
#> 14 On the Job Training 0 100
#> 15 On the Job Training 100 100
#> 16 Sitting vs. standing/walking at will 0 100
#> 17 Sitting vs. standing/walking at will 100 100
#> 18 Reaching 0 100
#> 19 Reaching 33 100
#> 20 Reaching 2 100
#> Requirement_category
#> 11 SVP
#> 12 SVP
#> 13 SVP
#> 14 SVP
#> 15 SVP
#> 16 PHY
#> 17 PHY
#> 18 PHY
#> 19 PHY
#> 20 PHY
ors.data <- errorsAndBounding(x.complete.errors,ors.data)
ors.data <- getPredictors(ors.data)
ors.data <- setDefaultModelingWeights(ors.data)
Note that here we remove all observations associated with the occupation_text All Workers, and we would advise doing the same, because it is a summary measure of all occupations considered in the ORS. For the purposes of this vignette only, we consider just four of the 22 possible 2-digit SOC codes. We include Table 2.1, listing the number of observations associated with each of the SOC2 codes retained. A guide to these codes can be found online at the BLS website. We also provide a snapshot of the data.
ors.data <- droplevels(ors.data[-which(ors.data$occupation_text=="All Workers"),])
ors.data <- droplevels(ors.data[which(ors.data$upSOC2 %in% c("11","13","15","19")),])
| SOC2 code | Count |
|---|---|
| 11 | 5916 |
| 13 | 5508 |
| 15 | 3672 |
| 19 | 1632 |
#> occupation_text upper_soc_code data_element_text
#> 1 Accountants 13201101 SVP
#> 2 Accountants 13201101 SVP
#> 3 Accountants 13201101 SVP
#> 4 Accountants 13201101 SVP
#> 5 Accountants 13201101 SVP
#> 6 Accountants 13201101 SVP
#> 7 Accountants 13201101 SVP
#> 8 Accountants 13201101 SVP
#> 9 Accountants 13201101 SVP
#> 10 Actuaries 15201100 SVP
#> 11 Actuaries 15201100 SVP
#> 12 Actuaries 15201100 SVP
#> 13 Actuaries 15201100 SVP
#> 14 Actuaries 15201100 SVP
#> 15 Actuaries 15201100 SVP
#> 16 Actuaries 15201100 SVP
#> 17 Actuaries 15201100 SVP
#> 18 Actuaries 15201100 SVP
#> 19 Administrative Services Managers 11301100 SVP
#> 20 Administrative Services Managers 11301100 SVP
#> data_type_text additive_group value std.error pct.remaining
#> 1 > 1 MONTH, = 3 MONTHS 10 NA NA 0.040
#> 2 > 1 YEAR, = 2 YEARS 10 0.017 0.005 NA
#> 3 > 2 YEARS, = 4 YEARS 10 0.313 0.031 NA
#> 4 > 3 MONTHS, = 6 MONTHS 10 NA NA 0.040
#> 5 > 4 YEARS, = 10 YEARS 10 0.630 0.040 NA
#> 6 > 6 MONTHS, = 1 YEAR 10 NA NA 0.040
#> 7 >SHORT DEMO, = 1 MONTH 10 NA NA 0.040
#> 8 OVER 10 YEARS 10 NA NA 0.040
#> 9 SHORT DEMO ONLY 10 NA NA 0.040
#> 10 > 1 MONTH, = 3 MONTHS 10 NA NA 1.000
#> 11 > 1 YEAR, = 2 YEARS 10 NA NA 1.000
#> 12 > 2 YEARS, = 4 YEARS 10 NA NA 1.000
#> 13 > 3 MONTHS, = 6 MONTHS 10 NA NA 1.000
#> 14 > 4 YEARS, = 10 YEARS 10 NA NA 1.000
#> 15 > 6 MONTHS, = 1 YEAR 10 NA NA 1.000
#> 16 >SHORT DEMO, = 1 MONTH 10 NA NA 1.000
#> 17 OVER 10 YEARS 10 NA NA 1.000
#> 18 SHORT DEMO ONLY 10 NA NA 1.000
#> 19 > 1 MONTH, = 3 MONTHS 10 NA NA 0.189
#> 20 > 1 YEAR, = 2 YEARS 10 NA NA 0.189
#> upper_bound lower_bound upSOC2 upSOC3 upSOC4 requirement frequency intensity
#> 1 0.040 0.000 13 132 1320 SVP 100 4
#> 2 0.022 0.012 13 132 1320 SVP 100 20
#> 3 0.344 0.282 13 132 1320 SVP 100 40
#> 4 0.040 0.000 13 132 1320 SVP 100 5
#> 5 0.670 0.590 13 132 1320 SVP 100 60
#> 6 0.040 0.000 13 132 1320 SVP 100 8
#> 7 0.040 0.000 13 132 1320 SVP 100 1
#> 8 0.040 0.000 13 132 1320 SVP 100 100
#> 9 0.040 0.000 13 132 1320 SVP 100 0
#> 10 1.000 0.000 15 152 1520 SVP 100 4
#> 11 1.000 0.000 15 152 1520 SVP 100 20
#> 12 1.000 0.000 15 152 1520 SVP 100 40
#> 13 1.000 0.000 15 152 1520 SVP 100 5
#> 14 1.000 0.000 15 152 1520 SVP 100 60
#> 15 1.000 0.000 15 152 1520 SVP 100 8
#> 16 1.000 0.000 15 152 1520 SVP 100 1
#> 17 1.000 0.000 15 152 1520 SVP 100 100
#> 18 1.000 0.000 15 152 1520 SVP 100 0
#> 19 0.189 0.000 11 113 1130 SVP 100 4
#> 20 0.189 0.000 11 113 1130 SVP 100 20
#> req.cat known.val weight
#> 1 SVP 0 0.5
#> 2 SVP 1 1.0
#> 3 SVP 1 1.0
#> 4 SVP 0 0.5
#> 5 SVP 1 1.0
#> 6 SVP 0 0.5
#> 7 SVP 0 0.5
#> 8 SVP 0 0.5
#> 9 SVP 0 0.5
#> 10 SVP 0 0.5
#> 11 SVP 0 0.5
#> 12 SVP 0 0.5
#> 13 SVP 0 0.5
#> 14 SVP 0 0.5
#> 15 SVP 0 0.5
#> 16 SVP 0 0.5
#> 17 SVP 0 0.5
#> 18 SVP 0 0.5
#> 19 SVP 0 0.5
#> 20 SVP 0 0.5
Next, we run through an example of model tuning. Our approach broke this down into two stages: train-test, and k-folds cross validation. Both stages utilize an interative modeling approach. In the train-test stage, model tuning is accomplished using only the known data. In order to mimic prediction, a subset of the data (test set) is thrown out at the beginning of the procedure to create mock missing data. In the k-folds stage, both the known and the missing data are used. The test fold is selected from the known data, creating mock missing data in the same manner as in the train-test stage.
The train-test paradigm was used to determine the model parameters to use moving forward, specifically the parameters nrounds, max_depth, and eta. In practice, this procedure was run multiple times until we optimized the parameters. Here, we simply run it once with the final parameterization that was obtained. The data are first separated into train/test/validate sets. Then, initial guesses are populated for the test set, followed by iterative model prediction.
model.results <- iterateModelTT(ors.data=ors.data,
n.iter=5,weight.step=0.05,
mdl.d=14,mdl.n=200,mdl.e=0.6)
#> [1] "Iteration 0..."
#> [1] "Generating test/train/validate sets..."
#> [1] "Generating guesses in test set..."
#> [1] "Iteration 1..."
#> [1] "Evaluating model..."
#> [1] "Iteration 2..."
#> [1] "Evaluating model..."
#> [1] "Iteration 3..."
#> [1] "Evaluating model..."
#> [1] "Iteration 4..."
#> [1] "Evaluating model..."
#> [1] "Iteration 5..."
#> [1] "Evaluating model..."
Next, we use k-folds cross validation (KFCV) to determine the blending ratio for our ensemble method. This approach first uses the smartGuess() function to generate an initial prediction for the missing values (both actual and mock), and then using an iterative modeling procedure to refine these predictions. The full process involves running the KFCV paradigm twice: once using SOC2 codes to guide smart guessing, and once using SOC3 codes. The results of these models are then blended using a ratio calculated from the convergence iterations of the individual models.
K-folds Cross Validation Modeling. During the modeling stage, the predictions are initialized using smart guessing and then refined in an iterative fashion that leverages XGBoost. Once this was done using both the SOC2 and the SOC3 codes to guide smart guessing, the predictions for the test folds from each iteration are collated to generate a “full” predicted dataset for each iteration. A sample of the collation results from the SOC2 model are shown below, with Prediction0 corresponding to the output of the smart guess procedure, and all subsequent predictions corresponding to those generated by the iterative modeling process. Note that in these results, the first column fold indicates which fold a given observation belongs to, the second column req.cat indicates the requirement category of the given observation, and the third column actual gives the actual value of the estimate (before it was thrown out).
# Use SOC3 codes to guide smart guessing, and model
soc3.mdl <- iterateModelKFCV(ors.data=ors.data,
n.iter=10,weight.step=0.05,
mdl.d=14,mdl.n=200,mdl.e=0.6,
fold.list=NULL,sg.soc.code="upSOC3")
#> [1] "Iteration 0..."
#> [1] "Generating folds..."
#> [1] "Iteration 1..."
#> [1] "Iteration 2..."
#> [1] "Iteration 3..."
#> [1] "Iteration 4..."
#> [1] "Iteration 5..."
#> [1] "Iteration 6..."
#> [1] "Iteration 7..."
#> [1] "Iteration 8..."
#> [1] "Iteration 9..."
#> [1] "Iteration 10..."
# Use SOC2 codes to guide smart guessing, and model
# (Use the folds generated during the SOC3 phase)
fl <- soc3.mdl$folds
soc2.mdl <- iterateModelKFCV(ors.data=ors.data,
n.iter=10,weight.step=0.05,
mdl.d=14,mdl.n=200,mdl.e=0.6,
fold.list=fl,sg.soc.code="upSOC2")
#> [1] "Iteration 0..."
#> [1] "Using provided folds..."
#> [1] "Iteration 1..."
#> [1] "Iteration 2..."
#> [1] "Iteration 3..."
#> [1] "Iteration 4..."
#> [1] "Iteration 5..."
#> [1] "Iteration 6..."
#> [1] "Iteration 7..."
#> [1] "Iteration 8..."
#> [1] "Iteration 9..."
#> [1] "Iteration 10..."
# Get test fold data
test.folds2 <- getTestFoldData(soc2.mdl)
test.folds3 <- getTestFoldData(soc3.mdl)
head(test.folds2)
#> fold req.cat actual
#> Accountants10>1YEAR,=2YEARS 3 SVP 0.017
#> Accountants10>2YEARS,=4YEARS 3 SVP 0.313
#> Accountants10>4YEARS,=10YEARS 6 SVP 0.630
#> AdministrativeServicesManagers10>2YEARS,=4YEARS 9 SVP 0.178
#> AdministrativeServicesManagers10>4YEARS,=10YEARS 7 SVP 0.633
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS 4 SVP 0.568
#> Prediction0 Prediction1
#> Accountants10>1YEAR,=2YEARS 0.02456839 0.02275250
#> Accountants10>2YEARS,=4YEARS 0.05925896 0.06338542
#> Accountants10>4YEARS,=10YEARS 0.18757081 0.18759358
#> AdministrativeServicesManagers10>2YEARS,=4YEARS 0.04587500 0.06337862
#> AdministrativeServicesManagers10>4YEARS,=10YEARS 0.10275000 0.31418519
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS 0.70700000 0.71037892
#> Prediction2 Prediction3
#> Accountants10>1YEAR,=2YEARS 0.01997569 0.01899637
#> Accountants10>2YEARS,=4YEARS 0.06987788 0.07096555
#> Accountants10>4YEARS,=10YEARS 0.19503461 0.20154545
#> AdministrativeServicesManagers10>2YEARS,=4YEARS 0.07842219 0.09856568
#> AdministrativeServicesManagers10>4YEARS,=10YEARS 0.39649393 0.46587905
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS 0.70914092 0.70320724
#> Prediction4 Prediction5
#> Accountants10>1YEAR,=2YEARS 0.01825424 0.01751429
#> Accountants10>2YEARS,=4YEARS 0.07141774 0.07495140
#> Accountants10>4YEARS,=10YEARS 0.19663334 0.19949098
#> AdministrativeServicesManagers10>2YEARS,=4YEARS 0.10032029 0.11351600
#> AdministrativeServicesManagers10>4YEARS,=10YEARS 0.49373051 0.52556225
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS 0.68931010 0.68215677
#> Prediction6 Prediction7
#> Accountants10>1YEAR,=2YEARS 0.01959155 0.01536126
#> Accountants10>2YEARS,=4YEARS 0.08175797 0.09033387
#> Accountants10>4YEARS,=10YEARS 0.20546383 0.20568970
#> AdministrativeServicesManagers10>2YEARS,=4YEARS 0.10088023 0.10014714
#> AdministrativeServicesManagers10>4YEARS,=10YEARS 0.51506374 0.51746200
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS 0.67802113 0.67973847
#> Prediction8 Prediction9
#> Accountants10>1YEAR,=2YEARS 0.01386481 0.01033710
#> Accountants10>2YEARS,=4YEARS 0.09518533 0.09667799
#> Accountants10>4YEARS,=10YEARS 0.20864541 0.21316246
#> AdministrativeServicesManagers10>2YEARS,=4YEARS 0.09679408 0.09572176
#> AdministrativeServicesManagers10>4YEARS,=10YEARS 0.53426800 0.53710864
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS 0.67523263 0.67540517
#> Prediction10
#> Accountants10>1YEAR,=2YEARS 0.009218727
#> Accountants10>2YEARS,=4YEARS 0.096899032
#> Accountants10>4YEARS,=10YEARS 0.220977512
#> AdministrativeServicesManagers10>2YEARS,=4YEARS 0.094293299
#> AdministrativeServicesManagers10>4YEARS,=10YEARS 0.537543921
#> AdvertisingandPromotionsManagers10>4YEARS,=10YEARS 0.672052267
Model Convergence. After running the models, we determine the iteration at which they converge. The predictions from this convergence iteration (calculated for both the SOC2 and SOC3 models) are eventually used to determine the blending ratio that will be used for the purposes of imputing missing values at a later step (see 4.2). Convergence is defined as when the difference in root mean square error (RMSE) between consecutive predictions is < 0.001. Note that in this vignette, because we are only using a subset of the data, the models never meet this convergence criteria. In this situation, the last iteration is returned as the convergence iteration (iteration 10, for both models).
computeConvergence(test.folds2)
#> RMSE by iteration:
#> Prediction0 Prediction1 Prediction2 Prediction3 Prediction4 Prediction5
#> 0.16333450 0.14508097 0.13534668 0.12786482 0.12147780 0.11619503
#> Prediction6 Prediction7 Prediction8 Prediction9 Prediction10
#> 0.11170793 0.10764041 0.10411858 0.10070571 0.09774716
#>
#> Differences in RMSE:
#> 0.01825352 0.009734287 0.007481862 0.00638702 0.005282773 0.0044871 0.004067521 0.003521825 0.003412872 0.002958549
Figure 3.1: RMSE by iteration of SOC2 model.
#> Convergence criteria unmet, returning final iteration
#> [1] 10
computeConvergence(test.folds3)
#> RMSE by iteration:
#> Prediction0 Prediction1 Prediction2 Prediction3 Prediction4 Prediction5
#> 0.17821356 0.14008939 0.13192531 0.12580777 0.12054885 0.11594333
#> Prediction6 Prediction7 Prediction8 Prediction9 Prediction10
#> 0.11163077 0.10795266 0.10427675 0.10124913 0.09818892
#>
#> Differences in RMSE:
#> 0.03812417 0.008164084 0.006117544 0.00525891 0.004605525 0.004312562 0.003678107 0.003675912 0.003027623 0.00306021
Figure 3.2: RMSE by iteration of SOC3 model.
#> Convergence criteria unmet, returning final iteration
#> [1] 10
Determine Blending Ratio. After determining the convergence iteration of each model, we then run an analysis to optimize the contribution of each model to the final prediction. The predictions from the two convergence iterations are blended in a ratio derived from minimizing the RMSE. From this analysis we arrived at a blending ratio of 51:49 (SOC2:SOC3). See below for the RMSE of predictions from each of the two analysis streams at the convergence iteration, as well as from the blended model. Figure 3.3 illustrates the blending ratio analysis. We report on this blended result in the next section. Importantly, this ensemble approach, and the ratio derived here, is also used throughout the remaining prediction steps, namely in 4.2.
blending.data <- computeBlendingRatio(soc2.mdl,soc3.mdl,print.plot=FALSE)
#> Computing convergence for model 1...
#> Convergence criteria unmet, returning final iteration
#>
#> Computing convergence for model 2...
#> Convergence criteria unmet, returning final iteration
#>
#> Convergence iteration for model 1: 10
#> Convergence iteration for model 2: 10
blending.data$soc2.proportion
#> [1] 0.51
blending.data$soc3.proportion
#> [1] 0.49
rmse(test.folds2$actual,test.folds2$Prediction10)
#> [1] 0.09774716
rmse(test.folds3$actual,test.folds3$Prediction10)
#> [1] 0.09818892
rmse(blending.data$test.folds.blended$actual,blending.data$test.folds.blended$Prediction10)
#> [1] 0.09414974
Figure 3.3: Calculation of model blending ratio. The predictions at the convergence iteration of each of the constituent models were combined in various ratios, and the RMSE of the resulting blended prediction was calculated. The ratio that minimized this RMSE was 51(SOC2):49(SOC3), represented by the red point on the plot.
Final Ensemble and Plot of Iterative Process. A visualization of the iterative process can be seen in Fig. 3.4. This visualization uses the blended, ensemble data, calculated from the ratio derived in the previous section. The actual value versus prediction is plotted to show the progression from the initial smart guess to the (adjusted) prediction for the each iteration of the model. With each iteration, XGBoost is moving the predictions towards their actual values: predictions are swept toward the diagonal at each step. Note that the blue points, representing the SVP requirement category which contains the requirement with the highest number of possible observations (Minimum education level, with 9 levels), have the furthest to move and have the highest error. Keep in mind that because this vignette only addresses a subset of the data, the convergence to the actual values is less pronounced than would be with the full dataset.
# Plot progression of (blended) predictions
p23 <- plotTestFolds(blending.data$test.folds.blended,print.plot=FALSE)
ggpubr::ggarrange(plotlist=p23,nrow=ceiling(length(p23)/2),ncol=2,labels="auto",legend="none")
Figure 3.4: Predicted estimates vs. their actual values by iteration. We can see a convergence of the predictions toward their actual values. Grey corresponds to ENV, yellow to PHY, and cyan to SVP.
Following model tuning, we address the original problem: imputation. This is carried out using the same iterative method utilized during tuning, operating on both the known data and missing data. Here, however, there is no mock missing data. Similar to the KFCV approach, the (actual) missing data is initialized with a prediction using smart guessing, and then XGBoost models are used to iteratively refine these predictions. However, it is important to capture some of the variability inherent in the data within the framework of imputation. To do this, we first create a set of simulated data, derived from the original known data. This is then used moving forward to generate a distribution of predictions for each the missing values, allowing for the calulation of confidence intervals rather than a single point estimate. The details of this procedure are demonstrated below.
Simulated data is drawn from a \(\beta\) distribution using, for all known observations, the value field (\(\mu\)) for the mean and the std.error field (\(\sigma_e\)) as the standard deviation. The details of this calculation can be found in the full paper. As a result of this step, each known observation has, in addition to its original value, 10 simulated values associated with it, resulting in 10 simulated datasets.
The newly generated datasets were then run through the imputation process and each missing value produced a range of predictions, generating both a mean estimate and a distribution thereof. In this way, missing values that were more sensitive to the initial assumptions show wider variation in their range of predictions, leading to broader confidence intervals for the imputed missing values. The details of this analysis are described in the following section, 4.2.
ors.data.sims <- computeSimulations(ors.data)
Recall that the known values are simulated 10 times, as per the procedure outlined in 4.1. The result of this is 10 distinct datasets (simulations) to perform imputation on. For each simulation, the data is first subjected to the smart guessing procedure. Then the full dataset (known and smart guessed estimates) is fed into an XGBoost fit, and the resulting model is used to predict the missing values. Note that, like in the KFCV step, two models are produced in this way: one using the SOC2 codes to guide smart guessing, and one using the SOC3 codes to do so. These models are then blended using the convergence iterations and the ratios computed in 3.2. See code and output below for details.
# Iterative imputing for SOC2 model
impute.soc2 <- iterateModel(ors.data.sims=ors.data.sims,n.iter=10,
weight.step=0.05,sg.soc.code="upSOC2")
#> [1] "Smart guessing for simulation 1 (Iteration 0)..."
#> [1] "Smart guessing for simulation 2 (Iteration 0)..."
#> [1] "Smart guessing for simulation 3 (Iteration 0)..."
#> [1] "Smart guessing for simulation 4 (Iteration 0)..."
#> [1] "Smart guessing for simulation 5 (Iteration 0)..."
#> [1] "Smart guessing for simulation 6 (Iteration 0)..."
#> [1] "Smart guessing for simulation 7 (Iteration 0)..."
#> [1] "Smart guessing for simulation 8 (Iteration 0)..."
#> [1] "Smart guessing for simulation 9 (Iteration 0)..."
#> [1] "Smart guessing for simulation 10 (Iteration 0)..."
#> [1] "-------ITERATING MODELS-------"
# Iterative imputing for SOC3 model
impute.soc3 <- iterateModel(ors.data.sims=ors.data.sims,n.iter=10,
weight.step=0.05,sg.soc.code="upSOC3")
#> [1] "Smart guessing for simulation 1 (Iteration 0)..."
#> [1] "Smart guessing for simulation 2 (Iteration 0)..."
#> [1] "Smart guessing for simulation 3 (Iteration 0)..."
#> [1] "Smart guessing for simulation 4 (Iteration 0)..."
#> [1] "Smart guessing for simulation 5 (Iteration 0)..."
#> [1] "Smart guessing for simulation 6 (Iteration 0)..."
#> [1] "Smart guessing for simulation 7 (Iteration 0)..."
#> [1] "Smart guessing for simulation 8 (Iteration 0)..."
#> [1] "Smart guessing for simulation 9 (Iteration 0)..."
#> [1] "Smart guessing for simulation 10 (Iteration 0)..."
#> [1] "-------ITERATING MODELS-------"
# Blend models using KFCV results
blended.model <- blendImputations(model.results.soc2=impute.soc2,
model.results.soc3=impute.soc3,
conv.iter.soc2=blending.data$soc2.conv.iter,
conv.iter.soc3=blending.data$soc3.conv.iter,
soc2.prop=blending.data$soc2.proportion,
soc3.prop=blending.data$soc3.proportion,
ors.data.sims)
#> [1] "Calculating mean prediction at iteration 10"
#> [1] "Calculating mean prediction at iteration 10"
After imputing over each of the simulations, we arrive at a distribution of 10 predictions per missing value. This allows us to calculate a mean prediction and its relevant 95% confidence interval for each of the missing values, a result which is more useful than a single point estimate. In Table 4.1 we provide a sample of these calculated confidence intervals (and associated mean predictions).
model.CIs <- computeCIs(blended.model)
CIs <- model.CIs$CIs
CIs <- cbind(ors.data.sims[rownames(CIs),c("occupation_text","data_element_text","data_type_text")],CIs)
| Occupation | Requirement | Requirement level | Mean prediction | CI (lower) | CI (upper) |
|---|---|---|---|---|---|
| Administrative Services Managers | SVP | > 1 MONTH, = 3 MONTHS | 0.0078362 | 0.0061895 | 0.0094830 |
| Public Relations and Fundraising Managers | Reaching overhead | SELDOM | 0.0385467 | 0.0260287 | 0.0510646 |
| Administrative Services Managers | Lifting/carrying Occasionally | NONE | 0.3711352 | 0.3388069 | 0.4034634 |
| Chief Executives | Pushing/pulling: Hands/arms | FREQUENTLY | 0.0026718 | 0.0017270 | 0.0036167 |
| Storage and Distribution Managers | Gross manipulation | OCCASIONALLY | 0.6072535 | 0.5637946 | 0.6507124 |
| Education Administrators, Postsecondary | Kneeling | OCCASIONALLY | 0.0076856 | 0.0033822 | 0.0119890 |
| Compliance Managers | High, exposed places | SELDOM | 0.0047139 | 0.0038372 | 0.0055907 |
| Sales Managers | Hearing requirements: Other Sounds | YES | 0.0615373 | 0.0570968 | 0.0659777 |
| Education Administrators, Postsecondary | Minimum education level | HIGH SCHOOL VOCATIONAL | 0.0001393 | -0.0000558 | 0.0003344 |
| Credit Analysts | Reaching overhead | NOT PRESENT | 0.9596998 | 0.9538858 | 0.9655137 |
| Regulatory Affairs Specialists | Lifting/carrying Seldom | >1 LBS, <=10 POUNDS | 0.5414520 | 0.5206967 | 0.5622072 |
| Insurance Underwriters | Crawling | CONSTANTLY | 0.0007694 | 0.0004728 | 0.0010659 |
| Claims Examiners, Property and Casualty Insurance | Hazardous contaminants | SELDOM | 0.0021681 | 0.0016925 | 0.0026437 |
| Compensation, Benefits, and Job Analysis Specialists | Stooping | CONSTANTLY | 0.0319217 | 0.0289224 | 0.0349210 |
| Credit Analysts | Heavy vibrations | OCCASIONALLY | 0.0004232 | 0.0002293 | 0.0006171 |
| Logistics Analysts | Hearing requirements: Group or conference | NO | 0.0897224 | 0.0790936 | 0.1003511 |
| Market Research Analysts and Marketing Specialists | Strength | HEAVY WORK | 0.0093089 | 0.0064523 | 0.0121656 |
| Computer Programmers | Reaching overhead | OCCASIONALLY | 0.0265749 | 0.0105536 | 0.0425963 |
| Information Technology Project Managers | Lifting/carrying Frequently | >1 LBS, <=10 POUNDS | 0.0247576 | 0.0143199 | 0.0351953 |
| Database Administrators | Gross manipulation | NOT PRESENT | 0.0263979 | 0.0057709 | 0.0470249 |
| Information Technology Project Managers | Extreme cold | FREQUENTLY | 0.0017380 | 0.0013339 | 0.0021420 |
| Actuaries | Hearing requirements: Group or conference | YES | 0.9629114 | 0.9555136 | 0.9703092 |
| Environmental Scientists and Specialists, Including Health | SVP | > 6 MONTHS, = 1 YEAR | 0.0203178 | 0.0161228 | 0.0245127 |
| Life, Physical, and Social Science Occupations | Pushing/pulling: Hands/arms | OCCASIONALLY | 0.0102104 | 0.0049567 | 0.0154641 |
| Life, Physical, and Social Science Occupations | Hazardous contaminants | SELDOM | 0.1625361 | 0.1463737 | 0.1786985 |
| Chemical Technicians | Minimum education level | PROFESSIONAL | 0.0394071 | 0.0335857 | 0.0452285 |
Uncertainty in the data itself is captured using the simulation technique describe above. However, the models themselves are also a source of uncertainty. We note that during imputation, not only are the missing values predicted, but also the known ones. With each iteration, the known observations are reset to their actual values prior to predicting in the subsequent iteration, but the original predictions are used to try and roughly quantify the uncertainty contributed by the models. To do so, we calculate both the mean absolute error (MAE) and the mean error (ME) between the actual values of known observations and the predictions generated by the model. This is done for each of the 10 simulations. At convergence (iteration 10 for all 10 simulations), the average (mean) MAE across all 10 simulation is 0.001934767, and the average ME is 6.421957e-05.
model.uncert <- computeModelUncertainty(impute.soc2,impute.soc3,10,10,0.51,0.49)
model.uncert$avg.mae
#> [1] 0.001934767
model.uncert$avg.me
#> [1] 6.421957e-05
The values imputed for the missing estimates in 4 can be used to gather and extrapolate insights from the data captured in the ORS. We address a handful of such examples, and the relevant functions in the imputeORS package, below.
The “Expected Level of Effort” (ELE) is a measure we derived to compare occupations. It is a weighted average of the frequency and intensity times the population estimate for the various requirements. A low frequency/low intensity/low population estimate results in a low level of effort, and the converse for high.
For each occupational group, ELE as an expected value \(E\) of frequency times intensity as follows, where \(\mu_j\) is the mean population prediction across all 10 simulations for the \(j^{th}\) observation, and \(F_j\) and \(I_j\) are the frequency and intensity of the \(j^{th}\) observation, respectively:
\[E=\sum_j{\mu_j*F_j*I_j}\]
expected.vals <- computeEVs(blended.model)
This measure can be used in a correlation calculation across occupations, as in Fig. 5.1, below. Further, it can be used to assess job similarity between different occupations (see 5.2.2 for more details). ELE can also be applied to improving survey design, but this is beyond the scope of this vignette (see full paper for further details).
htmp <- correlationPlot(blended.model,plot.dim=3,print.plot=FALSE)
htmp$plot.pub
Figure 5.1: Heatmap of occupational correlation of ELE, organized by 2-digit SOC code. Intra-group correlation (major diagonal) is generally higher than inter-group correlation.
One of our proposed applications of the (imputed) ORS data is computing the similarity between different occupations. We describe two separate methods for doing so, detailed below in 5.2.1 and 5.2.2.
Imputation generates a full distribution of population percentages for all the considered occupations. Thus, for two jobs, we can measure the overlap of these occupations by taking the product of their weightings to yield a value on the interval \([0,1]\). Further analysis of the requirement and the overlap can help lead to understanding of jobs that are similar or dissimilar.
If \(\omega_{1ir}\) is the population mean of the \(i^{th}\) level of the \(r^{th}\) requirement for Job 1 (average of simulation predictions for missing values, and actual value for known observations), and \(\omega_{2ir}\) is the same for Job 2, then we say that the overlap of \(r^{th}\) requirement (\(O_r\)) for these two jobs is:
\[O_r = \sum_{i}^{} \omega_{1ir}*\omega_{2ir}\] We demonstrate three overlap comparisons below, visualized in Fig. 5.2 - 5.4.
job.ovrlp1 <- computeOverlap(blended.model,"Accountants","Web Developers")
job.ovrlp2 <- computeOverlap(blended.model,"Accountants","Actuaries")
job.ovrlp3 <- computeOverlap(blended.model,"Accountants","Chemists")
Figure 5.2: Overlap of Accountants and Web Developers. Mean +/- SD (dashed lines) assumes equal weighting of captured activity.
Figure 5.3: Overlap of Accountants and Actuaries. Mean +/- SD (dashed lines) assumes equal weighting of captured activity.
Figure 5.4: Overlap of Accountants and Chemists. Mean +/- SD (dashed lines) assumes equal weighting of captured activity.
Previously, in 5.1, we established a metric by which occupations could be compared across requirements in a correlation analysis. In Fig. 5.5, we standardize these expected values and plot their distributions by requirement. On the far right we have aggregated all the additive groups corresponding to the requirement Lifting/Carrying (24, 25, 26, and 27) into a single distribution labeled LC. The same was done for Reaching (additive groups 16 and 18), labeled R.
# Standardize EVs
std.EVs <- standardizeEVs(blended.model)
# Differences in standardized EVs
stdEV.diff1 <- computeStdEVdiff(blended.model,"Accountants","Web Developers")
stdEV.diff2 <- computeStdEVdiff(blended.model,"Accountants","Actuaries")
stdEV.diff3 <- computeStdEVdiff(blended.model,"Accountants","Chemists")
Figure 5.5: Boxplot of ELEs, standardized by additive group (requirement). Red points indicate standardized μE ’s (i.e., 0), and blue points indicate outliers. Requirements where μE and the median do not align are indicative of skew in the expected values.
We note that these standardized ELEs provide another avenue for comparing job similarity, specifically by taking the difference in standardized ELEs between two occupations. Below are three figures demonstrating the differences between standardized expected values by occupation. Specifically, we visualize the same three pairs that were addressed in the overlap analysis, namely Accountants vs. Web Developers (Fig. 5.6), Accountants vs. Actuaries (Fig. 5.7), and Accountants vs. Chemists (Fig. 5.8).
Figure 5.6: Difference in standardized ELEs between Accountants and Web Developers.
Figure 5.7: Difference in standardized ELEs between Accountants and Actuaries.
Figure 5.8: Difference in standardized ELEs between Accountants and Chemists.
In this vignette we demonstrate the use of the imputeORS package, a collection of methods by which to impute values for missing estimates within the Occupational Requirements Survey, administered by the U.S. Bureau of Labor Statistics. Our method leverages many of the inherent features of the data to arrive at a distribution of predictions. These features include the bounded nature of the data (all observations must be on the interval [0, 1]), negative correlation within an occupational group (percentages that must sum to 100%), quantified errors of known observations, and similarity between different occupations within a 2- or 3-digit SOC group.
We use known errors to generate 10 simulations of the known estimates in order to generate a distribution of predictions, which allows us to ultimately compute confidence levels for our imputations rather than single point estimates. We then utilize an iterative approach using XGBoost to generate our predictions for each simulation, iterating the models until they reached convergence. The predictions from the convergence iteration of each simulation are used to calculate a mean prediction and its 95% confidence interval.
Finally, we demonstrate a number of applications of the imputed values. Further details of the methods and applications described in this vignette can be found in the full paper. The full paper also describes a generalized, iterative imputation algorithm based on the procedures implemented in the imputeORS package.